# Project: Deep Learning on Field Photography Reveals the Morphometric Diversity of Colombian Freshwater Fish
![Deep Learning on Field Photography Reveals the Morphometric Diversity of Colombian Freshwater Fish pipeline](Workflow/Workflow,%20step%202%20&%204/Workflow/Workflow.jpeg)




This repository contains four main modules developed for the automated analysis of fish images. Each module is responsible for a specific task:

1. **Workflow, Step 1: Segmentation of Fish Images:**  
   Utilizes deep learning models (Grounding-DINO and Segment Anything) to detect and segment regions of interest (fish) in the images. Batch inference is performed, and a CSV file with bounding box coordinates is generated.

2. **Workflow, Step 2: Morphometric Validation:**  
   After segmenting the regions of interest, this module evaluates the accuracy of the segmentations generated by the Segment Anything model. It uses standard metrics such as Intersection over Union (IoU) and the Dice coefficient. Additionally, the area (pixel count) and perimeter (pixel length) of the segmented fish are compared with those obtained from manual annotations to assess morphometric precision.

3. **Workflow, Step 3: Morphometric Descriptor Extraction:**  
   Once the regions have been segmented, this module computes several morphometric descriptors such as area, perimeter, Hu moments, and Zernike moments. These descriptors are essential for quantitative and comparative analyses of Colombian Freshwater Fish.

4. **Workflow, Step 4: Morphometric Diversity Analysis:**  
   Using the outputs from Descriptor Extraction, this module analyzes morphometric variation among Colombian Freshwater Fish to explore the distribution of species within the morphospace.

---

## Table of Contents

- [Requirements and Dependencies](#requirements-and-dependencies)
- [Environment Setup](#environment-setup)
- [Workflow, Step 1: Image Segmentation](#workflow-step-1-image-segmentation)
  - [Description and Features](#description-and-features)
  - [Usage Instructions](#usage-instructions)
- [Workflow, Step 2: Morphometric Validation](#workflow-step-2-morphometric-validation)
  - [Description and Features](#description-and-features-1)
  - [Usage Instructions](#usage-instructions-1)
- [Workflow, Step 3: Morphometric Descriptor Extraction](#workflow-step-3-morphometric-descriptor-extraction)
  - [Description and Features](#description-and-features-2)
  - [Usage Instructions](#usage-instructions-2)
- [Workflow, Step 4: Morphometric Diversity Analysis](#workflow-step-4-morphometric-diversity-analysis)
  - [Description and Features](#description-and-features-3)
  - [Usage Instructions](#usage-instructions-3)
- [Directory Structure](#directory-structure)
- [Additional Notes and Considerations](#additional-notes-and-considerations)

---

## Requirements and Dependencies

To run the code, for the python-based workflows (Steps 1 and 3), the following following packages and tools installed:

- **Python 3.x**  
- **PyTorch and Torchvision:** For loading models and tensor processing.  
- **TensorFlow:** Used for GPU verification and usage.  
- **OpenCV and Matplotlib:** For image processing and visualization.  
- **GroundingDINO and Segment Anything:** Image segmentation tools (installed via `pip` and GitHub).  
- **Transformers (by Hugging Face):** For handling and loading segmentation models.  
- **Mahotas:** For calculating Zernike moments in the morphometric descriptor extraction.  
- **Pandas and NumPy:** For data manipulation and numerical operations.  

Additionally, for the R-based workflows (Steps 2 and 4), the following R packages are required:

- **ggplot2**: For plotting and visualization.  
- **reshape2**: For data transformation.  
- **dplyr**: For data manipulation.  
- **fitdistrplus**: For distribution fitting.  
- **readr**: For reading data files.  
- **Metrics**: For model performance metrics.  
- **here**: For managing relative paths.  
- **gridExtra**: For enhanced graphical layouts.  
- **corrplot**: For correlation matrix visualization.  
- **funspace**: For morphospace analysis.  
- **factoextra**: For PCA visualization and extraction.  
- **tidyr**: For data reshaping.  
- **psych**: For descriptive statistics and psychometric analysis.

> **Note:** Some installations require executing commands from a Colab environment or terminal, such as using `!pip install` or `!wget`.

---

## Environment Setup

The code is designed to run in environments like Google Colab and locally (e.g., in RStudio for the RMarkdown scripts), leveraging Google Drive mounting to store and retrieve data. You must mount your drive and configure absolute paths for:
- Configuration files and checkpoints.
- Input directories containing images and output directories for results.
- The specific inference script (`fish_inference.py`).

Both TensorFlow and PyTorch GPU configurations are established to ensure that available resources are utilized effectively.

---

## Workflow, Step 1: Image Segmentation

### Description and Features

This module performs the following tasks:
- **Environment Verification:**  
  Prints the versions of PyTorch and Torchvision, and checks for CUDA availability.
- **Dependency Installation:**  
  Installs libraries such as `groundingdino-py`, `opencv-python`, `matplotlib`, and the Segment Anything repository.
- **Google Drive Mounting:**  
  Enables access to data stored on Google Drive.
- **Batch Inference Execution:**  
  - Iterates through multiple folders (specified in the `images` list) containing the images to be processed.
  - For each folder, it executes a Grounding-DINO inference script (`fish_inference.py`) using configuration parameters, checkpoints, prediction thresholds, and specifies the desired function (saving images, CSV, or both).
  - A CSV file is generated for each folder, containing predictions in the form of bounding boxes.
- **CSV File Cleaning:**  
  Processes the CSV file to remove unwanted characters from columns containing prediction data.
- **Segmentation with SAM:**  
  - Loads a pre-trained Segment Anything (SAM) model using a checkpoint.
  - Initializes a `SamPredictor` object to perform segmentation predictions on images using bounding boxes.

### Usage Instructions

1. **Environment Preparation:**  
   - Mount your Google Drive (or adjust local paths).
   - Ensure all the dependencies listed are installed (using `pip install` and `wget` commands as needed).
2. **Download Weights:**  
   - Download the required model weights **[https://drive.google.com/file/d/1uR0qmi9JmnGoGg6721I79Sg34QnJcw5Z/view?usp=sharing]**.
3. **Path Configuration:**  
   - Define absolute paths in variables such as `conf_path` and `base_dir`.
   - Update the `images` list with the names of the image folders.
4. **Execution:**  
   - The script iterates through each image folder, runs the inference, and cleans the resulting CSV file.
   - It then uses the SAM predictor to generate segmentation masks, displays results, and saves the predictions as `.npy` files.

---

## Workflow, Step 2: Morphometric Validation

### Description and Features

This module focuses on validating the segmentation and assessing morphometric errors using R. It is divided into two primary parts:

1. **Segmentation Performance and Statistical Error Assessment:**
   - **Loading Data and Libraries:**  
     Imports necessary R libraries (e.g., `ggplot2`, `reshape2`, `dplyr`, etc.) and reads a CSV file containing segmentation metrics (IoU & Dice).
   - **Data Preparation:**  
     Reshapes the data, filters metrics (values ≥ 0.8), and computes descriptive statistics (mean, standard deviation, and thresholds for ±1, ±2, ±3 SD).
   - **Visualization:**  
     Generates a violin plot that shows the distribution of the segmentation metrics with reference lines indicating the mean and standard deviation thresholds. The plot is saved in the `figures` folder.

2. **Morphometric Validation:**
   - **Loading Morphometric Data:**  
     Reads a CSV file containing observed and predicted morphometric values (e.g., area and perimeter).
   - **Error Annotation:**  
     Assigns error categories (e.g., "Segmentation error" or "Morphological extraction error") to specific images based on known error cases.
   - **Validation Metrics Calculation:**  
     Computes metrics such as RMSE, R², and RPD for each morphological parameter.
   - **Graph Generation:**  
     Creates scatter plots comparing observed versus predicted values, with identity lines and metric annotations. The plots are saved for further review.

### Usage Instructions

1. **Prepare Data:**  
   - Ensure that the CSV files (`segmentation_metrics (IoU & Dice).csv` and `segmentation_metrics (Obs & Pred).csv`) are placed in the `data` folder.
2. **Run the RMarkdown Script:**  
   - Open the RMarkdown file (e.g., `Workflow, Step 2: Morphometric Validation.Rmd`) in RStudio.
   - Knit the document to generate outputs in PDF or HTML format.
3. **Inspect Results:**  
   - Review the generated plots and statistical summaries, which are saved in the `figures` folder.

---

## Workflow, Step 3: Morphometric Descriptor Extraction

### Description and Features

This module analyzes the segmentation masks generated in Step 1 and extracts key morphometric descriptors, including:
- **Center of Mass and Image Center:**  
  Calculated using image moments and geometric methods.
- **Area and Perimeter:**  
  Measured using OpenCV functions to capture contour characteristics.
- **Hu Moments:**  
  Extracted to provide invariant shape descriptors.
- **Zernike Moments:**  
  Computed with the Mahotas library for detailed shape analysis.
- **Visualization:**  
  Overlays of the original mask, detected contours, and key points are produced for visual verification.
- **Result Storage:**  
  Exports a CSV file with all computed descriptors for each image.

### Usage Instructions

1. **Environment Preparation:**  
   - Verify that segmentation masks (e.g., `.npy` files) have been generated by the segmentation module.
   - Configure paths correctly (mount Google Drive or set local paths).
2. **Execution:**  
   - Run the descriptor extraction script (e.g., `Morphometric_descriptor_extraction.ipynb`) in your preferred Python environment.
3. **Output Inspection:**  
   - Review the visual overlays and check the exported CSV files containing the morphometric descriptors.

---

## Workflow, Step 4: Morphometric Diversity Analysis

### Description and Features

This module conducts an analysis of morphometric diversity using R. It consists of several key steps:

1. **Data Loading and Inspection:**  
   - Imports a CSV file (`PCA_variables.csv`) containing various morphometric variables (Hu moments, Zernike moments, area, perimeter, diameter, etc.).
   - Converts grouping variables to factors for subsequent analysis.

2. **Correlation Analysis:**  
   - **Data Transformation:**  
     Applies logarithmic transformation and scaling to selected variables.
   - **Visualization:**  
     Generates density plots for scaled variables and computes a correlation matrix, which is visualized using `corrplot`.

3. **Principal Component Analysis (PCA):**  
   - **Execution:**  
     Performs PCA on a subset of variables to extract the principal components.
   - **Graphical Outputs:**  
     Produces a biplot and a scree plot (with significance thresholds) to visualize the data structure and variance explained by each component.

4. **Morphospace Analysis:**  
   - **Morphospace Definition:**  
     Uses the `funspace` package to define and visualize the overall morphospace as well as group-specific spaces by taxonomic order.
   - **Detailed Visualization:**  
     Generates plots showing both the morphospace and segmented views by order, highlighting morphological differences among Colombian Freshwater Fish.
   - **Exporting Results:**  
     All figures are saved in the `figures` folder.

### Usage Instructions

1. **Prepare Data:**  
   - Place the `PCA_variables.csv` file in the `data` folder.
2. **Run the RMarkdown Script:**  
   - Open the RMarkdown file (e.g., `Workflow, Step 4: Morphometric Diversity Analysis.Rmd`) in RStudio.
   - Knit the document to produce outputs in PDF or HTML format.
3. **Review Outputs:**  
   - Examine the density plots, correlation matrix, PCA biplot, scree plot, and morphospace visualizations saved in the `figures` folder.

---

## Directory Structure

The suggested directory structure is as follows:

```plaintext
Dataset_CavFish_SAM/
├── conf/                          
│   ├── fish_inference.py               # Inference script for segmentation
│   └── cfg_coco.py                     # Configuration file for segmentation
├── CavFish Colombia/             # Folder containing input images
│   ├── 2016 Rio Bita/
│   ├── 2017 AndesCol/
│   ├── 2018 AndesCol/
│   ├── 2018 Caguan The Field Museum/
│   ├── 2018 Guayavero Duda/
│   ├── 2018 Manacacias Palmarito/
│   ├── 2018 Peces Ituango/
│   ├── 2018 Yungillo/
│   ├── 2019 Amoya/
│   ├── 2019 Inirida/
│   ├── 2019 Rio Vaupes/
│   └── 2022 2023 General/
│   └── Manual annotation (CVAT)/

├── output_folder/                      # Folder to save output images, CSV files, and predictions
├── Workflow/
│   ├── Workflow, step 1 & 3/
│   │   ├── 1. segmentation.ipynb                   # Jupyter Notebook for image segmentation
│   │   └── 3. Morphometric_descriptor_extraction.ipynb  # Notebook for descriptor extraction
│   └── Workflow, step 2 & 4/
│   │   ├── data/
│   │   ├── PCA_variables.csv                             # Variables used │for PCA and morphometric diversity 
│   │   ├── segmentation_metrics (IoU & Dice).csv         # Segmentation │performance metrics data
│   │   ├── segmentation_metrics (Obs & Pred).csv          # Observed vs. │predicted metrics for morphometric
│   │   ├── Workflow/
│   │   ├── 2. Morphometric Validation.jpeg                # Workflow image │for morphometric validation 
│   │   └── 4. Morphometric diversity analysis.jpeg         # Workflow image │for morphometric diversity analysis 
│   ├── figures/
│   │   ├── Histogram_Density_log222.jpg                   # Density plot for │segmentation metrics
│   │   ├── Scatter_Area_Observed_vs_Predicted.jpg       # Scatter plot for │Area (Observed vs. Predicted)
│   │   ├── Scatter_Perimeter_Observed_vs_Predicted.jpg    # Scatter plot │for Perimeter (Observed vs. Predicted)
│   │   ├── Correlation_Analysis.jpg                       # Correlation │matrix plot
│   │   ├── Biplot_PCA.jpg                                 # Biplot from PCA │analysis
│   │   ├── Enhanced_Scree_Plot.jpg                        # Scree plot with │significant PCA components 
│   │   ├── PCA_Biplot.jpg                                 # Enhanced PCA │biplot (Morphospace visualization)
│   │   ├── Colombia freshwater fish_morphospace_PC1-PC2.jpg # Global │morphospace plot
│   │   └── Colombian Morphospace by Order_PC1-PC2.jpg      # Morphospace │plots segregated by Order
│   ├── scripts/
│       ├── morphometric_validation.Rmd                  # R Markdown script │for morphometric validation workflow
│       └── morphometric_diversity_analysis.Rmd           # R Markdown script │for morphometric diversity analysis workflow
└── README.md                           # Project documentation
```
Researchers requiring these images for analysis may request them by contacting CavFish at cavfish@unibague.edu.co.

---

## Additional Notes and Considerations
  
- **Compatibility:**  
  Although the code is designed for Google Colab, it can be easily adapted for local execution by modifying the path configurations and mounting settings.
